Cluster analysis example
library(dplyr)
library(TeachingDemos)
library(KernSmooth)
library(MASS)
library(lattice)
library(ggplot2)
library(RColorBrewer)
library(viridis)
library(factoextra)
library(NbClust)Analysis on demographic data in EU
We continue to analyze the dataset in EUdemo.txt (see
the previous example of PCA). In this study we aim at exploring whether
countries can be grouped into clusters portraying different demographic
situations.
Remember that one row corresponds to the centroid of the first 15 EU countries.
- Eliminate it in order to make the analysis just on individual countries.
x <- read.table("../../EUdemo.txt", ..., ...)
n <- nrow(x)
p <- ncol(x)
ind_EU <- ...
x <- ...
# x <- x %>% slice(-ind_EU) # rownames are lost!Data representation (What features should be used?)
When our information consists of a profile matrix, an important step in cluster analysis is the choice of variables. E.g., We can use the original variables or a transformation of them, from a simple standardization to a PC scoring or to other possibilities.
Let’s decide whether to use raw or transformed data for clustering. To this aim, look at the distribution of the raw variables.
x %>% summary
NAT MOR CR.N MIG
Min. : 0.2254 Min. : 5.127 Min. :-6.3696 Min. :-10.9020
1st Qu.: 9.4017 1st Qu.: 8.778 1st Qu.:-1.1754 1st Qu.: -0.1701
Median :10.4273 Median : 9.889 Median : 0.4460 Median : 0.9284
Mean :10.8896 Mean :10.063 Mean : 0.8267 Mean : 1.1903
3rd Qu.:12.4762 3rd Qu.:10.890 3rd Qu.: 3.5132 3rd Qu.: 2.4504
CR.T M.IN VEC N.FI
Min. :-15.754 Min. : 2.400 Min. : 17.00 Min. :1.100
1st Qu.: -1.207 1st Qu.: 4.600 1st Qu.: 60.65 1st Qu.:1.290
Median : 1.826 Median : 5.850 Median : 78.45 Median :1.430
Mean : 2.017 Mean : 8.955 Mean : 74.00 Mean :1.509
3rd Qu.: 4.838 3rd Qu.:11.125 3rd Qu.: 88.60 3rd Qu.:1.732
E.P NU E.M V.M
Min. :24.70 Min. : 3.900 Min. :21.70 Min. :59.90
1st Qu.:26.60 1st Qu.: 4.800 1st Qu.:23.30 1st Qu.:68.28
Median :28.35 Median : 5.250 Median :25.95 Median :72.90
Mean :28.02 Mean : 5.673 Mean :25.65 Mean :71.56
3rd Qu.:29.32 3rd Qu.: 6.225 3rd Qu.:27.60 3rd Qu.:75.03
V.F
Min. :71.20
1st Qu.:75.35
Median :79.05
Mean :78.26
3rd Qu.:81.03
[ reached getOption("max.print") -- omitted 1 row ]
x %>% var %>% diag %>% sqrt
NAT MOR CR.N MIG CR.T M.IN
2.9578658 2.1624788 4.1705540 3.6036389 6.3627741 7.3494793
VEC N.FI E.P NU E.M V.M
24.7585519 0.3060433 1.7561905 1.5664737 2.3358947 4.6820333
V.F
3.2786714 - Finally, we take the decision of having the variables standardized.
x <- ...Here is the x dataset after processing (in html, scroll it, both horizontally and vertically).
| NAT | MOR | CR.N | MIG | CR.T | M.IN | VEC | N.FI | E.P | NU | E.M | V.M | V.F | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Austria | -0.4188946 | -0.1849450 | -0.2011949 | 0.3486411 | 0.0655819 | -0.6333782 | 0.7271225 | -0.6167428 | 0.0455531 | -0.4931459 | 0.5790073 | 0.7560818 | 0.8364669 |
| Belgio | 0.0938089 | 0.0946059 | 0.0174775 | 0.1065843 | 0.0718212 | -0.4973141 | 0.8482927 | 0.3308355 | 0.2733189 | -0.8761718 | 0.1937159 | 0.6920070 | 0.8669670 |
| Danimarca | 0.5234690 | 0.4901284 | 0.1171209 | 0.1598004 | 0.1672732 | -0.6469846 | 0.2666755 | 0.7229369 | 0.8996746 | 0.6559319 | 1.7348813 | 0.5638576 | 0.2264637 |
| Finlandia | 0.0883630 | -0.2398760 | 0.1870477 | -0.1476391 | 0.0389853 | -0.7286230 | 0.2989876 | 0.7556120 | 0.8996746 | -0.6208212 | 0.8786783 | 0.4784246 | 0.8364669 |
| Francia | 0.5750751 | -0.4155859 | 0.6233442 | -0.0955218 | 0.3544782 | -0.5653462 | 0.4161188 | 0.8536374 | 0.7288503 | -0.5569835 | 0.8358682 | 0.7133653 | 1.2634691 |
| Germania | -0.5079145 | 0.1133973 | -0.4190241 | 0.3528009 | -0.0748409 | -0.5789526 | 1.0785162 | -0.4860423 | 0.3302603 | -0.3016329 | 0.5361971 | 0.6279323 | 0.6839661 |
| Grecia | -0.4078752 | -0.1316862 | -0.2209948 | 0.3283008 | 0.0410838 | -0.4156757 | 1.5995483 | -0.6820930 | 0.3872017 | 0.1452307 | 0.3649565 | 0.8201565 | 0.6534659 |
| Irlanda | 1.1162565 | -0.7503568 | 1.1807467 | 1.1474068 | 1.4237826 | -0.4701013 | -0.9167539 | 1.2130637 | 1.4121475 | -0.4931459 | 1.0927291 | 0.4997829 | 0.2264637 |
| Italia | -0.5310467 | -0.0697143 | -0.3404846 | 0.1578244 | -0.1337887 | -0.5109205 | 2.0398002 | -0.9434940 | 1.1274403 | -0.5569835 | 0.8358682 | 0.9483059 | 1.1719686 |
| Lussemburgo | 0.6963896 | -0.5899557 | 0.7997963 | 2.6856328 | 2.0452784 | -0.8238679 | 0.0566471 | 0.7229369 | 0.7857917 | -0.5569835 | 0.7502479 | 0.6706488 | 0.8974672 |
| Paesi Bassi | 0.6032157 | -0.5444131 | 0.7101006 | 0.4383851 | 0.7137287 | -0.5109205 | -0.0362501 | 0.4615360 | 1.2982646 | 0.0175554 | 0.8786783 | 0.7987982 | 0.6839661 |
| Portogallo | 0.2446528 | 0.3419498 | -0.0037906 | -0.0524779 | -0.0322061 | -0.4564949 | 0.7109665 | -0.0612658 | 0.3302603 | 0.7836072 | 0.0652855 | 0.0939763 | 0.2569638 |
| Regno Unito | 0.2964898 | 0.2586459 | 0.0761674 | 0.4349457 | 0.2962619 | -0.4292821 | 0.3191826 | 0.5595613 | 0.2163774 | -0.3654706 | 0.5790073 | 0.7347235 | 0.4704650 |
| Spagna | -0.4412617 | -0.3080667 | -0.1532184 | -0.0486988 | -0.1280100 | -0.5517398 | 1.4783781 | -1.0088442 | 1.4690889 | -0.3016329 | 0.7930580 | 0.7347235 | 1.1414685 |
| Svezia | -0.3152388 | 0.2904594 | -0.3741818 | 0.0988957 | -0.1892510 | -0.7558359 | 0.7877076 | -0.0285907 | 1.0135575 | -1.0676847 | 1.7776915 | 1.1832466 | 1.1109683 |
| Albania | -3.6053965 | -2.2824675 | -1.3735563 | -1.3856389 | -1.6850878 | 2.9587130 | -2.2294317 | 3.5656719 | 0.1594360 | 0.9751201 | -1.1762089 | -0.6535622 | -0.8715421 |
| Bielorussia | -0.5682678 | 1.8487202 | -1.3616126 | -3.3555729 | -2.7929566 | 0.3462830 | -0.1533813 | -0.7147681 | -1.7765726 | 0.9751201 | -1.4758799 | -1.9991314 | -1.2985443 |
| Bosnia-Erzegovina | 0.7486665 | -1.1602882 | 1.1325962 | -1.5699589 | -0.1467931 | 0.2782510 | -1.8376479 | 0.1674600 | -1.1502169 | 0.4005813 | -1.0049683 | -0.3972633 | -0.9325424 |
| Bulgaria | -0.7042754 | 1.6438904 | -1.3518662 | -0.3302932 | -1.0731622 | 0.8089008 | 1.1229453 | -0.9108188 | -1.8904555 | -0.8761718 | -0.9193480 | -0.6962787 | -0.9630425 |
| Cipro | 0.6277560 | -1.1536278 | 1.0433899 | -0.5513626 | 0.3716307 | -0.4020693 | -1.0258072 | 1.0823632 | 0.3302603 | 4.5500286 | -0.0203348 | 0.7987982 | 0.6534659 |
| Croazia | -0.1682199 | 0.6827745 | -0.4733322 | -0.5996902 | -0.6498934 | -0.1707604 | -0.4643850 | -0.4206921 | -0.1252711 | -0.3016329 | -0.2343856 | -0.3545468 | -0.4445398 |
| Estonia | -0.7518377 | 1.2774651 | -1.1956030 | -0.4264828 | -1.0252159 | 0.0877613 | 0.2666755 | -0.8781437 | -0.8085683 | -1.1315224 | -0.4912465 | -1.2943095 | -0.5970406 |
| Islanda | 1.3161916 | -1.4854988 | 1.7037253 | 0.7702909 | 1.5529906 | -0.8918999 | -0.9854171 | 1.5724899 | 0.3872017 | -0.0462823 | 1.7776915 | 1.3327543 | 0.9889677 |
| Iugoslavia | 0.5205388 | 0.2221207 | 0.2540076 | -0.3433779 | -0.0279842 | 0.6592304 | -0.4603460 | 0.8536374 | -0.6946855 | -0.2377953 | -0.6196769 | -0.3331886 | -1.0240429 |
| Lettonia | -0.9843885 | 1.6031229 | -1.5293913 | -0.5471147 | -1.3123227 | 0.3598894 | 0.3312997 | -1.1395446 | -0.6946855 | -1.1315224 | -0.6196769 | -1.4438172 | -0.8715421 |
| Lituania | -0.3552825 | 0.3463268 | -0.4315499 | -0.2327850 | -0.4147052 | -0.0483027 | -0.2624346 | -0.5187175 | -0.8655097 | -0.5569835 | -1.0905886 | -0.9739358 | -0.3225392 |
| Macedonia | 1.2358387 | -0.7605571 | 1.2708465 | -0.6067469 | 0.4893521 | 0.7816880 | -1.3166158 | 0.8209623 | -1.0363340 | 0.8474448 | -1.0049683 | -0.2477556 | -1.1460435 |
| Malta | 0.3490564 | -0.9779687 | 0.7546477 | 0.1108205 | 0.5574072 | -0.2387924 | -0.6663354 | 0.9843378 | 0.5010846 | 0.4005813 | 0.4505768 | 0.5211411 | 0.5619654 |
| Moldova | -0.2512903 | 0.6210481 | -0.5002419 | -0.7251229 | -0.7385720 | 1.2170930 | -1.4539421 | 0.5268862 | -1.5488069 | 0.5282566 | -1.6899307 | -1.5719666 | -2.0610483 |
| Norvegia | 0.8115669 | 0.0310710 | 0.5594739 | 0.8513534 | 0.8488886 | -0.6878038 | 0.0929982 | 1.0823632 | 0.7288503 | -0.2377953 | 1.0927291 | 0.8628730 | 0.8669670 |
| Polonia | -0.3410263 | -0.0913270 | -0.1945107 | -0.4307829 | -0.3714738 | -0.0210899 | -0.4966971 | -0.4533672 | -0.4669197 | 0.0175554 | -1.0049683 | -0.7176369 | -0.3225392 |
| Repubblica ceca | -0.7392766 | 0.2839451 | -0.6715430 | -0.0928359 | -0.4927494 | -0.5925590 | 0.3676507 | -1.2375700 | -0.6377440 | -0.3016329 | -0.6624871 | -0.0341732 | -0.0175376 |
| Romania | -0.1521408 | 0.8039003 | -0.5247334 | -0.3611646 | -0.5484928 | 1.2987315 | -0.1129913 | -0.6820930 | -1.3779826 | 0.3367436 | -1.0477784 | -0.9525776 | -1.2680441 |
| Russia | -0.8677057 | 2.1409561 | -1.7255093 | -0.0237870 | -1.1444772 | 1.0810290 | -0.2260835 | -1.1068695 | -1.3210412 | 0.3367436 | -1.5186901 | -2.4903710 | -1.7865469 |
| San Marino | 0.1457497 | -1.1633592 | 0.7065842 | 2.8111854 | 2.0552898 | -0.7694423 | 1.3047007 | -0.6820930 | 2.3801518 | 1.9326849 | 1.1783494 | 1.8880686 | 1.3244694 |
| Slovacchia | -0.1604934 | -0.1628653 | -0.0293786 | -0.2582975 | -0.1655466 | -0.1027284 | -0.6663354 | -0.5840677 | -0.9224512 | -0.3654706 | -1.0477784 | -0.5467710 | -0.3225392 |
| Slovenia | -0.6980148 | -0.1760062 | -0.4037891 | 1.2230198 | 0.4280048 | -0.6469846 | 0.4847820 | -0.9761691 | -0.0113883 | -1.1315224 | 0.2793362 | 0.0512598 | 0.3179642 |
| Svizzera | 0.0286267 | -0.6077510 | 0.3354284 | 0.6407987 | 0.5827850 | -0.5925590 | 0.5494061 | -0.0939410 | 0.9566160 | 0.0175554 | 0.8786783 | 1.1191719 | 1.2939693 |
| Turchia | 3.4495426 | -1.7624610 | 3.3603616 | -0.1510413 | 2.1170437 | 3.9383742 | -2.3021338 | -1.3355953 | -0.8085683 | 0.3367436 | -1.3046393 | -1.0807270 | -2.1525488 |
| Ungheria | -0.5014068 | 1.9242497 | -1.3533557 | -0.3302932 | -1.0741385 | -0.0755156 | 0.4645869 | -0.7147681 | -0.5238612 | -0.7484965 | -0.6196769 | -1.1020853 | -0.9325424 |
From a preliminary to a more focused exploration for assessing clustering tendency
Icons
A preliminary exploration of the similarity between countries can be done by icon graphics.
- Let’s see the
starsgraph, both versions: “star” and “colorful wheel”.
stars(x, scale = TRUE, key.loc = NULL, draw.segments = FALSE, ...)See ?stars
Another icon graph is given by the Chernoff faces.
- Let’s see the
facesroutine ofTeachingDemospkg.
faces(xy, scale = TRUE, ...)See ?faces
Parallel coordinates
A graph that gives a dual view of data matrix consists of the parallel coordinates.
Let’s see at various routines doing parallel plots in either MASS or lattice pkgs, and, moreover, in iplots pkg for interactive graphics.
- See
-?MASS::parcoord,?lattice::parallelplot,?iplots::ipcp.
parcoord(x)parcoord(x, col = 1:n)parallelplot(x)# Loading iplots is unnecessary if you load JGR, that is recommended to use
# when working with iplots
library(iplots)
ipcp(x)# Loading iplots is unnecessary if you load JGR, that is recommended to use
# when working with iplots
library(JGR)
JGR()
# type commands in the editor window of JGR and press the 'Enter' key e.g.,
# an interactive parallel coordinates plot
ipcp(x)Plots on reduced data
The graph of scores of first principal components can give a view of the similarity between countries on a reduced space.
- Let’s obtain the first two PCs by using
prcomporprincompand plot the cloud of scores. See?prcomp,?princomp.
pcdemo <- prcomp(x)
pc <- pcdemo$x[, 1:2]
plot(pc)
...
...We can have a clearer vision of any clustering structure by estimating the density of the bivariate distribution of the first 2 PCs.
Let’s use the bkde2D routine in KernSmooth pkg and then use contour or persp« for plotting the density.
est <- bkde2D(pc, bandwidth = c(0.5, 0.5), gridsize = c(20, 20))
contour(est$x1, est$x2, est$fhat, main = "Empirical 2D-density ...", col = 4)
persp(est$fhat, main = "... of the first 2 PC", col = "green3") How many modes do you see?
We can also use the compound code offered by ggplot2 library.
pc <- pc %>% as.data.frame
p1 <- ggplot(pc, aes(x = PC1, y = PC2)) + stat_density2d(aes(fill = ..density..),
geom = "tile", contour = F)
p2 <- p1 + scale_fill_distiller(palette = "RdGy")
p3 <- p1 + scale_fill_viridis(option = "inferno")
ggpubr::ggarrange(p1, p2, p3, ncol = 3)Let turn back to the parallel coordinates:
- detect the possible anomalous profile of
Turchia; (b) detect how many components are useful to discriminate between units (and indirectly, choose the number of components to keep).
Ordered Dissimilarity Matrix
- Compute the dissimilarity (DM) matrix between the objects in the data set using the Euclidean distance measure.
- Reorder the DM so that similar objects are close to one another. This process create an ordered dissimilarity matrix (ODM)
See ?fviz_dist in factoextra library.
fviz_dist(dist(x), ...)
- Can you detect a clustering tendency from the ODM graph?
Hierarchical methods
Hierarchical methods need a distance matrix as input. Then, first decision is on the type of pair-wise distance. We consider the most popular one: the Euclidean distance.
Let’s see the dist routine and choose the Euclidean distance for computing the distance matrix.
d <- dist(x)
d
d %>% as.matrixx[1:2, ]
NAT MOR CR.N MIG CR.T
Austria -0.41889459 -0.18494502 -0.20119485 0.3486411 0.06558189
Belgio 0.09380886 0.09460592 0.01747745 0.1065843 0.07182119
M.IN VEC N.FI E.P NU E.M
Austria -0.6333782 0.7271225 -0.6167428 0.04555314 -0.4931459 0.5790073
Belgio -0.4973141 0.8482927 0.3308355 0.27331887 -0.8761718 0.1937159
V.M V.F
Austria 0.7560818 0.8364669
Belgio 0.6920070 0.8669670
x[1:2, ] %>% dist
Austria
Belgio 1.315535
d[1]
[1] 1.315535
as.matrix(d)[1:7, 1:7] %>% round(2)
Austria Belgio Danimarca Finlandia Francia Germania Grecia
Austria 0.00 1.32 2.70 1.91 2.23 0.68 1.19
Belgio 1.32 0.00 2.53 1.31 1.52 1.36 1.78
Danimarca 2.70 2.53 0.00 1.89 2.15 2.57 2.80
Finlandia 1.91 1.31 1.89 0.00 0.94 1.97 2.40
Francia 2.23 1.52 2.15 0.94 0.00 2.41 2.65
Germania 0.68 1.36 2.57 1.97 2.41 0.00 0.85
Grecia 1.19 1.78 2.80 2.40 2.65 0.85 0.00For what concerns the type of algorithm, we consider a bottom-up hierarchical algorithm. Let’s see the best-known distance methods between groups.
Nearest-neighbour or single linkage method
ls <- hclust(d, method = "single")
ls %>% names
[1] "merge" "height" "order" "labels" "method"
[6] "call" "dist.method"
ls %>% plot- Comment on the form of the bunches and on the information generally provided by the nearest-neighbour linkage.
Let’s try to identify what is the best level to cut the dendrogram by looking for the highest separation between clusters. Use rect.hclust to highlight the best partition.
m <- length(ls$height)
m
[1] 39
s <- c()
for (j in m:2) s <- c(s, ls$height[j]/ls$height[j - 1])
length(s)
[1] 38
max(s)
[1] 1.472332
b <- (m:2)[s == max(s)]
b
[1] 35
ls$height[b]/ls$height[b - 1]
[1] 1.472332
k <- (2:m)[s == max(s)]
k
[1] 6
rect.hclust(ls, k = k, which = k, border = 4)It is wise to detect what is the 2nd-best, and so on (stop at your choice).
The second-best is
k2 <- (2:m)[s == max(s[-(k - 1)])]
k2
[1] 3The 3-partition, the most relevant after the one shown above, separates Turchia and then Albania from the rest of Europe. Again, the single-linkage make anomalous units appear.
Furthest-neighbour or complete linkage method
lc <- hclust(d)
plot(lc)- Comment on the form of the bunches and on the information generally provided by the furthest-neighbour linkage.
s <- c()
for (j in m:2) s <- c(s, lc$height[j]/lc$height[j - 1])k = (2:m)[s == max(s)]
k
[1] 30
k2 <- (2:m)[s == max(s[-(k - 1)])]
k3 <- (2:m)[s == max(s[-c(k - 1, k2 - 1)])]
k4 <- (2:m)[s == max(s[-c(k - 1, k2 - 1, k3 - 1)])]
k5 <- (2:m)[s == max(s[-c(k - 1, k2 - 1, k3 - 1, k4 - 1)])]
k2, k3, k4, k5: 3, 12, 4, 6
The 3-partition appears to be relevant: on one side Albania, Moldova,
etc.; on the other Cipro, … UE …, Norvegia, San Marino, Islanda …; and,
distant from all countries, Turchia.
Continue to analyse the hierarchical clustering with Average linkage, Centroid method, and Ward method.
Average Linkage method
lm <- hclust(d, method = "ave")k, k2, k3: 6, 3, 9
- Compare the results of the centroid method to the previous ones.
Centroid method
lcen <- hclust(d, method = "cen")k, k2, k3: 6, 3, 30
- How is the distance at which fusions occur? Do you note any similarity with previous results?
Ward method
k, k2, k3: 2, 3, 12
Let’s conclude by summarising what each method tell us and what is/are the most promising cluster numbers.
k1 k2 k3 k4 k5
sl 6 3 31 2 7
cl 30 3 12 4 6
al 6 3 9 8 30
cen 6 3 30 8 9
w 2 3 12 30 8
Partitional methods
The kmeans routine
We consider the most popular partitional method: -means. We need to fix the cluster number at priori. Let’s start with .
We can take a cluster partition previously obtained by hierarchical clustering, or we can let R choose randomly the initial set.
See ?cutree.
kmeans(x, centers, iter.max = 10, nstart = 1, ...)
See ?kmeans.
- Plot the first two PC scores and colour points accordingly to the initial clusters of membership.
lm_cl6 <- cutree(lm, k = 6)
plot(pc, ...)
...
... Then, we run kmeans by using the initial set above derived.
init6 <- aggregate(x, list(lm_cl6), mean)
init6
Group.1 NAT MOR CR.N MIG CR.T
1 1 -0.00141097 0.1289484 -0.06786191 0.07521229 -0.001883425
2 2 -3.60539653 -2.2824675 -1.37355634 -1.38563891 -1.685087825
3 3 -0.56826784 1.8487202 -1.36161257 -3.35557293 -2.792956587
4 4 0.62775602 -1.1536278 1.04338993 -0.55136257 0.371630744
5 5 0.14574969 -1.1633592 0.70658422 2.81118544 2.055289802
M.IN VEC N.FI E.P NU E.M
1 -0.1734817 0.1258872 -0.05473082 -0.00813449 -0.2505628 0.07996324
2 2.9587130 -2.2294317 3.56567194 0.15943601 0.9751201 -1.17620886
3 0.3462830 -0.1533813 -0.71476814 -1.77657263 0.9751201 -1.47587991
4 -0.4020693 -1.0258072 1.08236319 0.33026030 4.5500286 -0.02033482
5 -0.7694423 1.3047007 -0.68209303 2.38015179 1.9326849 1.17834937
V.M V.F
1 0.02990154 0.06699142
2 -0.65356220 -0.87154206
3 -1.99913144 -1.29854430
4 0.79879825 0.65346592
5 1.88806858 1.32446943
[ reached getOption("max.print") -- omitted 1 row ]
km6 <- kmeans(x, init6[, -1], 100)
names(km6)
[1] "cluster" "centers" "totss" "withinss"
[5] "tot.withinss" "betweenss" "size" "iter"
[9] "ifault" Play with the components of the output to discover their content.
km6$iter
km6$ifault
km6$size
km6$centers
km6$tot.w
fitted(km6)
fitted(km6, method = "classes")- Now, plot the first two PC scores and colour points accordingly to the obtained partition.
plot(pc, ...)
...
...The -partition obtained by kmeans is significatively different from that obtained by the hierarchical method.
Remember that outliers are likely present in our data. Try larger cluster numbers to enucleate them.
In case of cluster sizes and outliers are respectively:
km6 ...
... ...[1] 14 1 14 3 7 1
[1] "Albania" "Turchia"
The kmeans function has an nstart option that attempts multiple initial configurations. This approach is often recommended.
km6best <- kmeans(x, 6, 100, 25) Is the final clustering (right panel) the same as the previous grouping (left panel)?
Sizes and outliers are
[1] 8 1 14 6 1 10
[1] "Albania" "Turchia"
km6$tot.w
[1] 144.0562
km6best$tot.w
[1] 142.8716In any case, try different cluster numbers and evaluate whether the results vary by changing initializations.
E.g,, tet try a larger cluster number .
lm_cl9 <- cutree(lm, k = 9)
init9 <- aggregate(x, list(lm_cl9), mean)
km9 <- kmeans(x, init9[, -1], 100)Has the increase of cluster number obtained the expected result?
Cluster sizes and outliers are respectively:
[1] 14 6 1 3 3 10 1 1 1
[1] "Albania" "Cipro" "San Marino" "Turchia"
The comparison between the 9-partition obtained from the tree-derived
intialization and the best 9-partition selected among multiple random
initializations is below: Is the final clustering (right panel) the same as the previous grouping (left panel)?
Sizes and outliers are
[1] 1 5 1 1 1 11 1 9 10
[1] "Albania" "Bielorussia" "Cipro" "San Marino" "Turchia"
km9$tot.w
[1] 88.79624
km9best$tot.w
[1] 89.9779However, classification appears rather unstable.
Let try
lm_cl3 <- cutree(lm, k = 3)
init3 <- aggregate(x, list(lm_cl3), mean)
km3 <- kmeans(x, init3[, -1], 100)A greater stability is obtained by choosing .
km3best <- kmeans(x, 3, 100, 25) 14) Can you notice any coincidence with partitions obtained by hieachical clusterings?
As regards the issue of possible influence of outliers, try different
distance methods (e.g. Manhattan distance). Remember that kmeans
utilizes only the Euclidean distance. (Refer to cluster library for more algorithms and methods.)
More on cluster number selection
In order to select the best (s) let’s consider one out of the several indexes measuring the quality of clustering.
E.g., we consider the ratio between deviance/total deviance, which is a measure of the separation between clusters.
A possible procedure is
kmax <- 30
for (g in 2:kmax) assign(paste0("km", g), kmeans(x, g, 100, 10000))betweenss <- vector(l = kmax - 1)
for (g in 2:kmax) betweenss[g - 1] <- get(paste0("km", g))[["betweenss"]]
totss <- km2$totss
plot(2:kmax, betweenss/totss, type = "b", pch = 20, xlab = "cluster #", cex = 0.65)Can you see a “knee” in the plot?
The package NbClust provides 30 indexes for determining the optimal number of clusters in the data.
NbClust(data = NULL, diss = NULL, distance = "euclidean", min.nc = 2, max.nc = 15, method = NULL, index = "all", ...)
- Run
NbClustseparately for each cluster and plot the results (the overall run cannot be carried out because of indefinite results for some indexes).
Krzanowski and Lai (1988)
nc <- NbClust(x, method = "kmeans", index = "kl")
nc
$All.index
2 3 4 5 6 7 8 9 10
2.5776 2.4954 0.3920 2.6342 0.8426 1.7706 0.4228 38.5523 0.0425
11 12 13 14 15
1.1385 7.1692 0.1656 4.9314 1.7001
$Best.nc
Number_clusters Value_Index
9.0000 38.5523
$Best.partition
Austria Belgio Danimarca Finlandia
2 2 2 2
Francia Germania Grecia Irlanda
2 2 2 5
Italia Lussemburgo Paesi Bassi Portogallo
2 5 5 2
Regno Unito Spagna Svezia Albania
2 2 2 4
Bielorussia Bosnia-Erzegovina Bulgaria Cipro
3 9 3 1
Croazia Estonia Islanda Iugoslavia
6 3 5 9
Lettonia Lituania Macedonia Malta
3 6 9 5
Moldova Norvegia Polonia Repubblica ceca
9 5 6 6
Romania Russia San Marino Slovacchia
6 3 8 6
Slovenia Svizzera Turchia Ungheria
2 2 7 3 Calinski and Harabasz (1974)
Hartigan (1975)
Sarle (1975), Scott and Simons (1971) Marriot (1971) Milligan and Cooper (7 et 8) (1975) Friedman and Rubin (9 and 10) (1967) NO
Hubert and Levin (1976)
Davies and Bouldin (1979)
Rousseeuw (1987)
Duda and Hart (1973)
Duda and Hart (1973)
Beale (1969)
Warning in pf(beale, pp, df2): Si è prodotto un NaN
Ratkowsky and Lance (1978)
Ball and Hall (1965)
Milligan (1981)
Tibshirani et al. (2001)
Frey and Van Groenewoud (1972)
McClain and Rao (1975)
Baker and Hubert (1975)
Rohlf (1974) Milligan (1981)
Dunn (1974)
Hubert and Arabie (1975) NO
Halkidi et al. (2000)
Lebart et al. (2000) NO
Halkidi and Vazirgiannis (2001)
Numbers of clusters chosen by 20 criteria:
Results
Let’s show the results for the selected clustering partitions.
From the above internal validation, 2/3/9-group partitions result interesting. We show the results for the 3-group clustering.
Cluster size and units of the best 2-group partition are respectively:
km2$size
[1] 18 22
for (h in 1:2) cat("cluster ", h, "\n", row.names(x)[km2$cluster == h], "\n")
cluster 1
Albania Bielorussia Bosnia-Erzegovina Bulgaria Croazia Estonia Iugoslavia Lettonia Lituania Macedonia Moldova Polonia Repubblica ceca Romania Russia Slovacchia Turchia Ungheria
cluster 2
Austria Belgio Danimarca Finlandia Francia Germania Grecia Irlanda Italia Lussemburgo Paesi Bassi Portogallo Regno Unito Spagna Svezia Cipro Islanda Malta Norvegia San Marino Slovenia Svizzera Cluster size, units, centroids of the best 3-group partition are respectively:
km3$size
[1] 21 14 5
for (h in 1:3) cat("cluster ", h, "\n", row.names(x)[km3$cluster == h], "\n")
cluster 1
Austria Belgio Danimarca Finlandia Francia Germania Grecia Irlanda Italia Lussemburgo Paesi Bassi Portogallo Regno Unito Spagna Svezia Islanda Malta Norvegia San Marino Slovenia Svizzera
cluster 2
Bielorussia Bulgaria Croazia Estonia Iugoslavia Lettonia Lituania Moldova Polonia Repubblica ceca Romania Russia Slovacchia Ungheria
cluster 3
Albania Bosnia-Erzegovina Cipro Macedonia Turchia
NAT MOR CR.N MIG CR.T M.IN
1 0.1699365 -0.2869012 0.2692849 0.5867785 0.5088355 -0.5906152
2 -0.4303623 0.9388805 -0.7920443 -0.5755429 -0.8451207 0.3462830
3 0.4912815 -1.4238803 1.0867276 -0.8529497 0.2292291 1.5109914
VEC N.FI E.P NU E.M V.M
1 0.4978607 0.1752398 0.7695228 -0.1739576 0.8358682 0.7662524
2 -0.1245313 -0.5700641 -0.9753254 -0.2469149 -0.9315794 -1.0364850
3 -1.7423273 0.8601724 -0.5010846 1.4219837 -0.9022239 -0.3161020
V.F
1 0.7943476
2 -0.8737206
3 -0.8898422Appendix
Some graphics introduced in the explorative phase are presented after carrying out the cluster analysis.